-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DL1 writer tool (ctapipe-stage1-process) #1163
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1163 +/- ##
=======================================
Coverage 91.11% 91.11%
=======================================
Files 183 183
Lines 12542 12542
=======================================
Hits 11428 11428
Misses 1114 1114 Continue to review full report at Codecov.
|
just a quick question to understand better the context, how much different is "ctapipe-stage1-process" from "protopipe write_dl1.py" |
The DL1 structure is completely different. |
Related to #1165 |
It will replace it in the next major protopipe release. But it's not fully the same thing yet - for example there is only 1 cleaning in this version, in prototype there are 2, with 1 being the biggest island, and thus 2 sets of parmeters. So we need to think how to do that in a nice way (perhaps always have a default set called "parameters/", and any additional are just new datasets like "parameters_bigisland/" |
That's for efficiency: the data from each telescope type has the same shape, so why not put it in the same table to avoid overhead of many datasets? If you want to plot something per telescope, you still have the In the lower-levels, you treat each telescope separately perhaps, but there isn't a good reason in DL1, since those differences should be calibrated away (except for inter-telescope calibration). Really, I would like to have all parameters and images in a single table each, but since at least for the images, the size changes for each telescope type, so it's not possible. For the parameters, the last revision did that (and just allowed you to split by tel_type_id if you wanted - we could go back to that as well). I guess it depends on whether we want to later do benchmarks by individual telescope, or by type? I would guess most are by type, so it's easier to have them already combined that way. |
@vuillaut Though I suppose you are right, in real data, we'd split by tel_id, not type, so perhaps it is easier to do that. It's an easy change, but it then makes making a plot of e.g. some quantitiy for each telescope type more annoying, since you have to merge all telescope tables first. But this way, merging is also not too hard, since a single telescope would write a table called LST_LST_LSTCam, with |
Maybe I'll make an option to split by telescope type vs telescope id. For training or benchmarking, type is most useful, but for real data, id is better... (but then we have 2 variants of the data model... not sure that is nice either). Perhaps just an API function for easily converting between the two would be ok (e.g. something that merges all the per-telescope tables into a single per-type table on request) |
I think this would be a bad approach, for exactly as you say. Would storing a table per telescope id be that inefficient, considering we generally have a fixed number of telescopes? Alternatively just a high level method in the DL1Reader to obtain the table for a telid might be sufficient |
There is overhead for each table (metadata, etc), but not sure how much that is. If you think of this in database terms, nobody would ever store multiple tables with the same schema since that's hard to manage, they would just invent a new index. But here, since merging would be easer (just linking the telescope data sets into a single file), it may be the right way to go. |
A third possibility would be to use hdf5 variable length arrays in just a single table. From the data point of view, I like this the most. I don't know, what the performance impact will be. |
We tried that a while back, and the performance was pretty bad - you lose a lot of the HPC features. It also makes it hard to read anything with pandas, etc. |
Yes, that's what I meant by an API function. I think it may be the right way to go, but it means you can't |
Reading arrays with pandas is always bad, does not matter whether same shape per row or different length. This is what I tried: import tables
import numpy as np
from tables import IsDescription, UInt64Col, UInt16Col
class DL1(IsDescription):
array_event_id = UInt64Col()
telescope_id = UInt16Col()
telescope_events = [
(1, 1, np.random.normal(size=1440)),
(1, 2, np.random.normal(size=1039)),
(1, 3, np.random.normal(size=1855)),
(2, 3, np.random.normal(size=1855)),
(2, 4, np.random.normal(size=1855)),
(3, 1, np.random.normal(size=1440)),
]
with tables.open_file('test.hdf5', mode='w') as f:
f.create_group('/', 'dl1')
table = f.create_table('/dl1', 'telescope_events', DL1)
images = f.create_vlarray('/dl1', 'images', tables.Float32Atom())
for event_id, tel_id, img in telescope_events:
row = table.row
row['array_event_id'] = event_id
row['telescope_id'] = tel_id
row.append()
images.append(img) |
This has the advantage, that you can read the data that has the same shape for all telescope nicely using pandas and for the images, you get lists of numpy arrays, which seems to be not that bad compared to multiple tables. |
For me, reading the variable length images is a little faster than reading from three same shape tables:
The variable length array file is slightly larger, I don't know why. Here is the code: import tables
import numpy as np
from tables import IsDescription, UInt64Col, UInt16Col, Float32Col
# create some pseudo data
np.random.seed(0)
num_pixels = [1855, 1039, 2048]
telescope_types = [0] * 4 + [1] * 10 + [2] * 20
telescope_ids = np.arange(len(telescope_types))
n_events = 100
telescope_events = []
for event_id in range(n_events):
n_telescopes = min(2 + np.random.poisson(10), len(telescope_ids))
triggered = np.random.choice(telescope_ids, n_telescopes, replace=False)
for tel_id in triggered:
size = num_pixels[telescope_types[tel_id]]
telescope_events.append((
event_id, tel_id, np.random.normal(size=size)
))
# write out as one table with metadata and one variable length array column
class DL1(IsDescription):
array_event_id = UInt64Col()
telescope_id = UInt16Col()
with tables.open_file('vlarray.hdf5', mode='w') as f:
f.create_group('/', 'dl1')
table = f.create_table('/dl1', 'telescope_events', DL1)
images = f.create_vlarray('/dl1', 'images', tables.Float32Atom())
for event_id, tel_id, img in telescope_events:
row = table.row
row['array_event_id'] = event_id
row['telescope_id'] = tel_id
row.append()
images.append(img)
# write out as same shape tables for each telescope type
def description(num_pixel):
class DL1(IsDescription):
array_event_id = UInt64Col()
telescope_id = UInt16Col()
image = Float32Col(shape=(num_pixel))
return DL1
with tables.open_file('tables.hdf5', mode='w') as f:
f.create_group('/', 'dl1')
tables = {}
for event_id, tel_id, img in telescope_events:
tel_type = telescope_types[tel_id]
if tel_type not in tables:
desc = description(num_pixels[tel_type])
tables[tel_type] = f.create_table('/dl1', f'type_{tel_type}', desc)
table = tables[tel_type]
row = table.row
row['array_event_id'] = event_id
row['telescope_id'] = tel_id
row['image'] = img
row.append() |
Trying to compare the advantages and drawbacks of both approaches, here is what I can come up with:
In the case of per
(something similar with Of course, I like loading all my parameters at once, but I am not sure this outweigh the drawbacks. |
There is also the compression issue. It seems varlen arrays are not (well) compressible. |
That might well be compression. |
I did not enable any compression for either. |
stage1: Use zstd compression by default
Improve defaults in stage1 config example
…nto feature/dl1_tool
@kosack I'm working on a fix for the extractor issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi.
The data model looks great. I actually don't have any more comments on it (we know it will evolve in the future but it encompasses all the exchanges we had I think).
I have a couple of suggestion for improvements but I don't think they are show-stoppers.
I will continue digging today.
""" | ||
Generate DL1 (a or b) output files in HDF5 format from {R0,R1,DL0} inputs. | ||
|
||
# TODO: add event time per telescope! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still TODO?
) | ||
|
||
# convert all values to strings, since hdf5 can't handle Times, etc.: | ||
# TODO: add activity_stop_time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, done
The config file should be in JSON or python format (see traitlets docs). For an | ||
example, see ctapipe/examples/stage1_config.json in the main code repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for later: getting this config file from the online Tool would be nice for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Exporting the used config? That is supported by the Tool
itself, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean having a way to dump the exhaustive config file without having to download the (non-exhaustive?) example from the repository. It's a proposal of improvement, nothing blocking of course.
image_criteria = self.check_image(image_selected) | ||
self.log.debug( | ||
"image_criteria: %s", | ||
list(zip(self.check_image.criteria_names[1:], image_criteria)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be interesting to write the criteria status event per event as columns in the DL1 parameters table? overkill?
This looks like a combination of noise and cleaning. That fits with most parameters here:
We could apply the same cleaning to the true image before calculating the true parameters, but that would be problematic when using a cleaning method that relies on time, as we don't have peak time for the true image right now. |
Yes, you probably have to plot those in bins of total intensity to better understand the distributions. The true_hillas parameters are just the ones computed from the noise-free image, so at low intensity, you're dominated by noise. Since this is a power-law spectrum, the plots you make are mostly dominated by low-energy/low-intensity images. I don't think this is a problem with the code itself. |
That would be a solution indeed but we could also argue that the signal extraction (and thus the cleaning) is part of the benchmark. There is no wrong way to look at it, just need to keep it in mind. |
Woohoo |
Overview:
This is a PR to start a discussion, containing a fully working DL1 writer tool that writes a proposed standard DL1 HDF5 format. This is currently all contained in
ctapipe/tools/stage1.py
.This PR is provided so we can discuss the general workflow, usage, and data format produced, but please see the caveats below.
Caveats:
Currently there are a lot of extra features included in the
stage1.py
file, that were needed to get everything to work properly:ImageCleaner
classDataChecker
class (for performing cuts and tracking statistics)Container
classes, and a few algorithms to fill themStage1Process
Tool could be also moved to common places (e.g. functions to write the SubarrayDescription in HDF5 format, could go in a file likectapipe/io/hdf5.py
)These all should eventually move to standard places in ctapipe, but for now are included so that they can slowly be replaced by separate PRs that put them where needed. I suggest we avoid discussion/review of the implementations of those pieces until their subsequent PRs, but the overall idea of them can be discussed here if needed.
No unit tests are currently there for these features (will come with the PRs mentioned).
Ideally once we are ready to merge this PR, it will be simplified down to a bare minimum, and the rest will be already parts of ctapipe.
Usage:
If you want to test running this, you need to check out this branch, run
make develop
(so that the new executablectapipe-stage1-process
is generated), then run:And follow instructions.
You can either specify a lot of things on the command-line, or better yet use a config file similar to this:
Currently the config-file overrides anything on the command line (this is how
traitlets.config
works, but it is not ideal), so it is best to leave out anything you would want to change on the command-line from the config file.Output:
The output can be viewed using
vitables
GUI, or accessed using thetables
(PyTables) python module (for the DL1 data), or the AstropyTable
or Pandas for the configuration data.In Vitables it looks like this (if you enable the
--write-images
option to get the DL1a data as well as the parameters):Example access to the image data:
Here, the x-axis is pixel_id and the y axis is Event count.